COCA Filters: Co-occurrence Aware Bloom Filters

نویسندگان

  • Kamran Tirdad
  • Pedram Ghodsnia
  • J. Ian Munro
  • Alejandro López-Ortiz
چکیده

We propose an indexing data structure based on a novel variation of Bloom filters. Signature files have been proposed in the past as a method to index large text databases though they suffer from a high false positive error problem. In this paper we introduce COCA Filters, a new type of Bloom filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We show experimentally that by using this technique we can reduce the false positive error by up to 21 times for the same index size. Furthermore Bloom filters can be replaced by COCA filters wherever the co-occurrence of any two members of the universe is identifiable.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Approximate Duplicate-Elimination in RFID Data Streams Based on d-Left Time Bloom Filter

Article history: Received 6 March 2010 Received in revised form 16 July 2011 Accepted 18 July 2011 Available online 31 July 2011 The RFID technology has been applied to a wide range of areas since it does not require contact in detecting RFID tags. However, due to the multiple readings in many cases in detecting an RFID tag and the deployment of multiple readers, RFID data contains many duplica...

متن کامل

Bloofi: Multidimensional Bloom Filters

Bloom filters are probabilistic data structures commonly used for approximate membership problems in many areas of Computer Science (networking, distributed systems, databases, etc.). With the increase in data size and distribution of data, problems arise where a large number of Bloom filters are available, and all them need to be searched for potential matches. As an example, in a federated cl...

متن کامل

Proposals of Co-occurrence Frequency Image Based Filters

We have discussed that the co-occurrence frequency image (CFI) defined based on the co-occurrence frequency histogram of the gray value of an image has a potential to introduce a new scheme for image feature extraction. This paper proposes a couple of filters for image enhancements of such as sharpening and smoothing filters. These filters are very similar to but quite different from those whic...

متن کامل

Optimizing Learned Bloom Filters by Sandwiching

We provide a simple method for improving the performance of the recently introduced learned Bloom filters, by showing that they perform better when the learned function is sandwiched between two Bloom filters.

متن کامل

Bloom-Based Filters for Hierarchical Data1

In this paper, we present two novel hash-based indexing structures, based on Bloom filters, called breadth and depth Bloom filters, which in contrast to traditional hash based indexes, are able to represent hierarchical data and support path expression queries. We describe how these structures can be used for resource discovery in peer-to-peer networks. We have implemented both structures and o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011